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ABSTRACT 



The advent of computers in educational and psychological 
measurement has lead to the need for algorithms for optimal assembly of tests 
from item banks. This paper reviews the literature on optimal test assembly 
and introduces the contributions to this report on the topic. Four different 
approaches to computerized test assembly are discussed: heuristic-based test 
assembly; 0-1 linear programming; network -flow programming; and an optimal 
design approach. In addition, applications of these methods to a large 
variety of problems are examined, including: (1) item response theory-based 

test assembly; (2) classical test assembly; (3) assembling multiple test 
forms; (4) item matching; (5) observed- score equating; (6) constrained 
adaptive testing; (7) assembling tests with item sets; (8) item pool design; 
and (9) assembling tests with multiple traits. This paper concludes with a 
90-item bibliography on test assembly. (Contains three figures and seven 
references . ) (Author/SLD) 
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Abstract 

The advent of computers in educational and psychological measurement has led to the need of 
algorithms for optimal assembly of tests from item banks. This paper reviews the literature on 
optimal test assembly and introduces the contributions to this special issue on the topic. Four 
different approaches to computerized test assembly are discussed: heuristic-based test 
assembly, 0-1 linear programming, network-flow programming, and an optimal design 
approach. In addition, applications of these methods to a large variety of problems are 
examined, including IRT-based test assembly, classical test assembly, assembling multiple 
test forms, item matching, observed-score equating, constrained adaptive testing, assembling 
test with item sets, item pool design, and assembling tests with multiple traits. The paper 
concludes with a bibliography on optimal test assembly. 
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Optimal Assembly of Educational and Psychological Tests, with a Bibliography 

In his chapters on item response theory (IRT) in Lord and Novick (1968), Bimbaum 
introduced a method of test assembly that was immediately acclaimed to be the proper 
approach to the problem. The method involves the following three steps: First, a goal for the 
test is formulated. Examples of possible goals are: admission decisions to an educational 
program, diagnosis of the skills of the students in the lower tail of a population distribution, or 
replacement of a test that has become obsolete by a parallel form. Second, the goal for the test 
is used to set a target for the test information function. Examples of such targets are given in 
Figure 1. Third, a test is assembled such that its information function matches the target. In 

[Insert Figure 1 about here] 

doing so, the fact is used that the item information functions are additive. Formal definitions of 
the concepts of item and test information are given later in this paper. 

In spite of its immediate recognition, it took a long time before Bimbaum's method 
was actually used in the practice of test assembly. One reason for this delay was the fact that 
the method could not be performed by hand. But even when computers became available, it 
appeared’ difficult to formulate algorithms guaranteeing the optimality of a test assembled 
from an item pool. Finally, and most importantly, in practice tests are seldom assembled only 
to match a target for their information function but also have to meet large sets of 
specifications dealing with such attributes as test content, item format, cognitive level, or 
section lengths. In the early days of computerized testing it was not known how to implement 
Birnbaum's method to meet such specifications as well. 

However, the formal structure of the above test assembly problem is not unique and 
can be found in many problems in industry, trade, commerce, and everyday life. Examples are 
the problems of putting together an investment portfolio, composing a diet, drafting a 
production schedule, packing a suitcase, or purchasing goods in a supermarket. The structure 
shared by these problems is the one of constrained combinatorial optimization (Nemhauser & 
Wolsey, 1988; Rao, 1985; Wagner, 1972). Each problem belonging to this class is 
characterized by the presence of a finite pool of "items" (e.g., stocks, nutrients, travel 
attributes) from which a combination has to be selected (e.g., portfolio, diet, contents of 
suitcase). The task is to select a combination of items that is optimal with respect to one 
attribute (e.g., maximum profit, maximum nutritional value, minimum weight) and at the 
same time meets a variety of constraints on other attributes of the problem (e.g., budget 
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available, minimum daily intake of vitamins and minerals, volume of suitcase). Problems of 
combinatorial optimization have been studied in decision theory, operations research, 
statistics, and management science. 

To present test assembly as an example of constrained combinatorial optimization, an 
important distinction is made between the following two types of test specifications: 

1. Constraints . These specifications require a test attribute or a function of item 
attributes to meet an upper and/or a lower limit. Constraints can be formulated 
as mathematical (in)equalities. 

2. Objectives . These specifications require a test attribute or a function of item 
attributes to take a minimum or maximum value. Objectives can be formulated 
as mathematical functions that are to be optimized. 

A test assembly program is now defined as a combination of an objective with a set of 
constraints. An example of a small IRT-based test assembly program is given in Figure 2. 

[Figure 2 about here] 

Observe that this program has three different types of constraints: 

1. Constraints on categorical item attributes (e.g., content classification; use of 
graphics). These attributes partition the item pool, and the constraints hold for 
the distribution of the items over this partition. 

2. Constraints on quantitative item attributes (word counts; expected response 
times). Constraints of this type require a function of the attributes (usually a 
sum or an average) over a set of items to meet an upper or lower bound. 

3. Constraints on dependencies between items. Examples are constraints 
representing a relation of exclusion (mutually exclusive items) or inclusion 
between the items (e.g., items presented as sets with a common stimulus). 

In practice, test assembly problems may involve many more attributes than the five attributes 
used in this example (for a catalogue, see van der Linden and Boekkooi-Timminga (1989). 

Each possible objective involves its own optimal combination of items for a given 
item pool. Test assembly programs can therefore optimize only one objective function at a 
time. On the other hand, the number of constraints is not limited by any a priori bound. The 
only requirement is that the set of constraints leave a non-empty set of feasible solutions , that 
is, collections of items meeting each of the constraints. In principle, a large set of constraints 
can do so, but an inadvertently chosen small set can already overconstrain the problem and 
lead to infeasibility. Problems of infeasibility in test assembly models are analyzed in 
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Timminga and Adema (1996) and in the contribution by Timminga (1998) to this special 
issue. 

Often, the same test assembly problem can be formulated as a variety of programs. For 
example, an important decision is whether or not to formulate a specification as an objective 
or a constraint. If, for a given item pool, the maximum value of the test information function 
at 0 O is approximately known, the objective function in Figure 2 can be replaced by a constraint 
that requires information at this point to be larger than a well-chosen lower bound. This 
replacement would allow another constraint to be formulated as the objective. Also, it is possible 
to join several test specifications into a weighted combination of functions of different item 
attributes and use this combination as an objective. Other choices emerge if a test assembly 
program is translated into a mathematical optimization model; examples of such choices will be 
met later in this paper. 



Basic Approaches 

To find a solution to a test assembly program, a computer algorithm is needed. Four 
different approaches to solving test assembly programs will be discussed. Each of these 
approaches is represented by one or two contributions to this special issue of the journal. The 
first approach is based on the use of an intuitively attractive heuristic. This approach does not 
involve any mathematical modeling of the assembly program but formulates an item-selection 
rule that is built into a computer program. In the second and third approach, decision variables 
for the selection of the items for the test are defined. The variables are used to model the 
assembly problem as a mathematical programming problem with an objective function and 
constraints. An algorithm is then used to solve the model for an optimal combination of 
values for the decision variables. The fourth approach is based on the optimal design approach 
in statistics. This approach does not involve any combinatorial optimization but calculates a 
distribution of parameter values over a theoretic range that would yield a test with an optimal 
value for an objective function. These four approaches, combinations of which are often used 
in practice, are now discussed in more detail. 

Heuristic-Based Test Assembly 

Most heuristics in the literature on test assembly are based on sequential item 
selection. That is, they select one item at a time, and the selection process is stopped when a 
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sufficient number is reached. These heuristics also belong to a class known as greedy 
heuristics in the optimization literature (e.g., Nemhauser & Wolsey, 1985, sect. n.5). The only 
other class of heuristics that has received some interest in the test assembly literature are those 
based on genetic algorithms (Michalewicz, 1994). 

The basic nature of the greedy heuristic can be illustrated using the exemplary test 
assembly program in Figure 2. Indices i=l,...,I and j=l,...,n are used to denote the items in the 
pool and in the test to be assembled, respectively. Thus, ij is the index in the pool of the jth 
item in the test. Suppose j-1 items have been selected; the indices of these items form the set 
Sj.i s {ii,...ij.i }. Therefore, Rj s (l,...,I}\Sj.i is the set of items in the pool from which the jth 
item has to be selected. Finally, let 1(0) denote Fisher's information in item i on the unknown 
parameter 0 (for a formal definition of this measure, see Lord, 1980, chap. 5). 

If the test has to have maximum information at 0o, a greedy heuristic would select each 
next item to have maximum information at this value. It would be based on the following 
criterion: 

ij s max,{I,(0o);teRj}. (1) 

To meet the categorical constraints in Figure 2, sets Rj could be defined for each of the classes 
of the partition defined by the attributes. Item selection could then cycle along these classes 
proportionally to the numbers needed from them. Constraints on quantitative attributes or 
dependencies between items are more difficult to deal with in heuristics. The contributions by 
Luecht (1998) and Sanders and Verschoor (1998) to this special issue are based on the use of 
a greedy heuristic. 

One of the first heuristics for IRT-based test assembly in the literature is given in 
Ackerman (1989; see also Wang & Ackerman, 1998). The heuristic has been designed to 
assemble a set of parallel test forms to meet a common target for their information functions 
but will be discussed here for the case of assembling a single form. It is assumed that test 
information is controlled at a series of discrete values 0k, k=l,...,K, where T(9k) is the target 
value for the test information function at 0k. At each step, the heuristic first selects the value of k 
for which the difference between current information and its target value is maximal. Then the 
item with maximum information at this value is selected. Let kj denote the index of the value of 
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0 used to select item j. Then, for j=l,...,n, the item selection process cycles through the following 
two criteria: 

kj = max s {T(0 s )- Z Ii (0 S ); s=l,...,K}, (2) 

teSj-i 

ij = max,{I,(0kj);teRj}. (3) 

A problem with Ackerman’s heuristic is that the test information function is likely to 
overshoot its target for several 0 values--a result typical of greedy heuristics. Luecht and Hirsch 
(1992) present a heuristic of a more tempered nature. Like (2), their heuristic is based on the 
difference between current information at 0k and its target value. However, it divides the 
difference by the remaining number of items to be selected, n-j+1 : 

5j(0 k ) = [T(0k)- Z Ii (0k)] /(n-j+1) (4) 

1 G Sj-i 



The quantities 5j(0k) are used as target values for the information function in the selection of the 
jth item: 



K 

ij =mint{ vvj(0k)l Ii(0k)“5t(0k)l* Rj }, (5) 

k = l 



where the weights Wj(0k) in (5) are added to promote the selection of items contributing most at 
0 values with large gaps between item information values and the targets. A more detailed 
introduction to this heuristic and the way it deals with various types of constraints on item 
selection is given in the contribution by Luecht (1998) to this special issue. 

The heuristic by Swanson and Stocking (1993) supposes that all test specifications 
have been formulated as constraints. The heuristic minimizes a weighted sum of expected 
deviations from the constraints. Constraint 5 in Figure 2 is taken as an example, where wj is 
used to denote the number of words in item i. If the jth item is selected and item teRj is the 
candidate, the expected number of words in the total test is defined as: 

0 
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X wi + w t X wi. 

: O • . 1-J T. .W.l 



( 6 ) 



The first term in (6) is equal to the number of words in the j~l items already selected, the 
second term is the number of words in candidate item t, and the last term is equal to n-j times 
the average number of words in the set of remaining items in set Rj. The expression in (6) is 
thus derived under the assumption of choosing item t and random sampling of the rest of the 
items from set Rj\{ t } . 

The Swanson-Stocking heuristic calculates these expected values for all constraints. It 
then calculates the extent to which these expectations violate the bounds in the constraints. 
Finally, a weighted sum of the deviations is calculated, and the item with the smallest value 
for the weighted sum is selected. The use of weights not only allows us to express preferences 
for constraints but is also necessary to compensate for scale differences between attributes and 
bounds. 

As already noted, the only addition to the class of greedy heuristics for test assembly 
are those based on genetic algorithms (Verschoor, 1998). Genetic algorithms do not select 
items sequentially. They start with a pool of candidate solutions for the full test that are 
improved in a probabilistic way simulating an evolutionary process. A key feature of genetic 
algorithms is that they have a nonzero probability of backtracking. Greedy heuristics, on the 
other contrary, make choices that are locally optimal but may end up with solutions that are 
not globally optimal. These heuristics are therefore often followed by a second process in 
which some of the items in the solution are replaced by alternatives. For example, Ackerman 
(1989) recommends swapping items between multiple forms to improve the extent to which 
they are parallel. Likewise, Swanson and Stocking (1993) recommend a second stage in which 
items whose removal would result in a reduction of the weighted sum of deviations are 
replaced by more promising ones. 

0-1 Linear Programming 

As already noted, the critical difference between this approach and the previous one is 
the definition of decision variables to assign items from the pool to the test. These variables 
are used to model the objective as a mathematical function and the constraints as (in)equalities 
to be imposed on its optimization. An example is formulated for the test assembly program in 
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Figure 2. 

Let Xi, i=l,...,I, be the variable to represent the decision whether (xi=l) or not (xi=0) to 
assign item i from the pool to the test. The sets of indices of the items in the pool on 
knowledge of fact, applications, and with graphics will be denoted as Vk, V a , and V g , 
respectively. In addition to the quantitative attribute w* for the number of words in item i, the 
attribute n is used for the expected response time on item i. 

The model is as follows: 

I 

maximize 2 IiOo) xi (maximum information at 0o) (7) 

i=l 



subject to 



X xi <10, (knowledge of facts) (8) 

i€V k , 

X xi ^ 10 , (applications) (9) 

i€V a 

X Xi=5, (graphics) (10) 

i€Vg 

I 

X xj = 25 , (test length) (11) 

i = l. 

Xwj xi ^1,500, (word counts) (12) 

i=l 



Xri xi <60, (expected response times) (13) 

i = l 





X64+X65 < 1 , 



(mutually exclusive items) 



(14) 
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XieO,l, I=1,...,I (range of variables) (15) 

Since the variables are zero-one, the sum in (7) is the information in the test at 0 O . Likewise, 
the sums in (12) and (13) are the total number of words and the expected response time for the 
test, respectively. In (8)-(10), the sums of variables are the numbers of items in the test form the 
various sets; in (15) this sum represents the length of the test. 

The expressions in (7)-( 14) are linear in the variables. The constraints in (11) are 
technical constraints that define the range of the variables. The optimization problem therefore 
belongs to 0-1 linear programming (LP). Optimal values for the decision variables Xi, i=l,...,I, 
can be found using standard LP software or a dedicated test assembly software package such 
as ConTEST (Timminga, van der Linden & Schweizer, 1996). Exact solutions to 0-1 LP 
problems are obtained through a complete branch-and-bound (B&B) search. Such searches are 
known to be NP-hard; that is, their solution time is not bounded by a polynomial of the size of 
the problem. Exact solutions of large problems may therefore require an excessive amount of 
time. However, solutions with values for the objective function differing from the optimum by 
a predetermined, negligibly small factor can easily be obtained for item pools of a realistic 
size. An algorithm for doing so is the described in by Adema, Boekkooi-Timminga and van 
der Linden (1991; see also Timminga, van der Linden & Schweizer, 1996, sect. 6.6.5). The 
algorithm fixes some of the decision variables using a result in Crowder, Johnson and Padberg 
(1983). In addition, the value of the objective function in the solution to the relaxed problem, 
that is, with the 0-1 variables replaced by variables that can take values in [0,1], is employed 
to derive a stopping rule for a B&B search for the solution in the original problem. Fan (1997) 
used the algorithm to assemble six parallel forms of 60 items, each with approximately 200 
constraints, from a pool of nearly 3,000 items within 1 1 mins. 

To the knowledge of the author, the first to apply linear programming to model a 
problem in testing was Votaw (1952). Feuermann and Weiss (1973) used the technique to 
solve a test assembly program. The application of LP linear programming to test assembly 
was also alluded to in Yen (1983). A seminal paper was the one by Theunissen (1985) who 
modeled Bimbaum's problem of a test to meet a target information function as a 0-1 LP 
problem. This paper stimulated others to use the same methodology to model a large variety 
of other test assembly problems (see the papers by Adema, Baker, Boekkooi-Timminga, 
Boomsma, de Gruijter, Gademann, Glas, Kester, Razoux Schultz, Timminga, and van der 
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Linden in the bibliography at the end of this paper). In this special issue, the paper by van der 
Linden and Reese (1998) demonstrates the use of 0-1 LP to build item selection constraints 
into an algorithm for computerized adaptive testing. 

Network-Flow Programming 

Integer programming problems are defined as LP problems with decision variables that 
can take a larger range of integer values than just the values of 0 and 1. In special cases, 
integer problems take the form of a network-flow or transportation problem. If so, quick 
solutions to large problems are possible. An example of a problem with a network-flow 
structure is given by the directed graph in Figure 3. Nodes Si on the left-hand side are supply 

[Figure 3 about here] 

nodes; nodes Dj on the right-hand side demand nodes. The directed arcs or arrows indicate a 
flow or transportation from the supply to the demand nodes. For each arc there is a decision 
variable xy denoting the units of flow from node Si to Dj. The constraints in a network-flow 
problem deal with the numbers of units available at the supply nodes, the bounds on the 
numbers needed at the demand nodes, or the costs associated with a units of flow along the 
arc from i to j, cy. If the number of supply nodes is equal to the number of demand nodes and 
the decision variables take only the values 0 and 1, network-flow problems are known as 
assignment problems. Also, transhipment nodes can be added between the supply and demand 
nodes to accommodate a larger class of problems. Transshipment nodes have both demand 
and supply constraints associated with them. 

An important result in network-flow programming is that among the solutions to the 
relaxed or continuous version of the problem there is always one with integer values for the 
variables. This solution is found by the well-known simplex algorithm in LP. Moreover, the 
structure of the network-flow problems allows for an efficient implementation of the simplex 
algorithm resulting in solution times for large problems that seldom take more than seconds 
on a personal computer. 

Some test assembly problems can be formulated as network-flow problems. For 
example, suppose that for i=l,...,m supply nodes Si represent the items in the example in 
Figure 2 that measure knowledge of facts whereas for i=m+l,...,I, they represent the items that 
do not measure at this cognitive level. In addition, demand nodes Dj, j= 1 ,2, represent the sets 
of items needed in the test form that do and do not measure knowledge of facts, respectively. 
The decision variables xy denote whether (xy=l) or not (xy=0) item i is shipped to the part of 
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the test represented by demand node Dj. Finally, the "cost" of shipping item i to Dj is defined 
as its information at 0o, li(0k) (changing the problem from a minimization into a maximization 
problem). The test assembly problem consisting of the objective and the first constraint in Figure 
2 can be modeled as the following network-flow problem: 

I 

maximize £ K0o)xij (maximum information at 0o) (16) 

i=l 

subject to 
2 

£ X jj<l, i=l, (supply at Si,... ,Si) (17) 

j=l 



ni 

2 xii = 10 (demand at Di) (18) 

i = l 



I 

£ Xi 2 == 1 5 (demand at D 2 ) (19) 

i = n+l 

XijG {0,1}, i=l,...,I, j=l,2, (range of variables) ( 20 ) 

where xh =0 for i>m and Xi2=0 for i<m. 

Most test assembly problems with categorical attributes can be modeled as network- 
flow problems with demand nodes representing classes of items defined by combinations of 
attributes. Since these classes need not form a partition of the item bank and transshipment 
nodes can be added, flexibility is large. The fact that realistic problems typically may involve 
thousands of variables (number of items times number of demand nodes) need not bother us; 
such network-flow problems can generally be solved quickly. 

However, problems with quantitative attributes are more difficult to model. One 
approach is to embed the network-flow problem in a heuristic, for example, using Lagrangian 
relaxation. In this technique, all quantitative constraints are removed from the constraint set 
and added to the objective function as penalty terms times a Lagrange multiplier. For 
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example. Constraint 5 in Figure 2 is added to the objective function in (16) as: 



I I 

maximize £ IiOo) xjj - X(l,500 - £ wjxj) ^ (21) 

i=l i=l 



A solution is typically found cycling through the process of finding a suitable value for X, 
solving the network-flow problem, and improving on the current value of X until a satisfactory 
result is obtained. Results are usually still quick and near optimal but may suffer from constraint 
violation. 

Test assembly problems with constraints representing dependencies between items in 
the pool can not always be formulated as network-flow problems either. However, the same 
approach of embedding a reduced problem in a larger heuristic can be followed to attack such 
problems. 

An excellent review of network-flow programming models with Lagrangian relaxation 
for test assembly is given in Armstrong, Jones and Wang (1995). Nearly all of their empirical 
examples have calculation times less than 2 mins. In the contribution by Armstrong, Jones and 
Kunce (1998) to this special issue, the same technique is used to assemble a series of parallel 
test forms. Other applications are given in the papers by Armstrong et al., Boomsma, and 
Veldkamp in the bibliography. 

O ptimal Design Approach 

The final approach reviewed here is based on the theory of optimal experimental 
design developed in statistics (e.g., Fedorov, 1972). One of the first problems addressed in 
optimal design theory was the designing of an experiment for estimating the parameters in a 
linear regression model. The standard approach in optimal design theory is to choose a set of 
design points (=grid of values for the independent variables) and find an experimental design 
(=distribution of observations over these points) that would result in optimal accuracy of the 
parameter estimates. Since most experiments have multiple parameters, the criterion of 
optimality is typically defined on the variance-covariance matrix of the estimators. Popular 
functions are the determinant, the trace, and the eigenvalue of the this matrix; solutions with 
optimal values for these criteria are known as D-, A-, and L-optimal, respectively. 

Since IRT models can be viewed as regression models, it seems obvious to apply the 
techniques of optimal design to parameter estimation problems in IRT. Applications consists 
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of optimal design of experiments for estimating the item as well as the examinee parameters. 
The latter is the problem of optimal test design. A solution to the problem is a joint 
distribution of the item parameter values with optimal accuracy for the ability estimator. 
However, unlike standard regression models, IRT models are nonlinear and have unobserved 
independent values. How to deal with these issues is explained in the reviews of optimal 
design approaches to IRT by Berger (1997) and van der Linden (1994b). The contribution by 
Berger (1998) to this special issue of the journal applies optimal design techniques to tests 
with dichotomous and polytomous item formats. Other applications of optimal test design are 
given in the papers by Berger et al. in the bibliography. 

Discussion 

Important yardsticks to evaluate the appropriateness of the various approaches to test 
assembly problems are: (1) easiness of modeling the problem; (2) optimality of the solution; 
(3) possibility of constraint violation; and (4) computer time needed. A heuristic approach is 
generally quicker than all other approaches. However, its solutions are mostly suboptimal to 
an extent that remains unknown and may violate some of the constraints. As already observed, 
the use of heuristics does not involve any modeling but for new problems it usually takes a 
considerable amount of time to adjust the heuristic, for example, to find best weights if the 
objective is to minimize a sum of weighted deviations form a large set of constraints. 

The strong advantage of the 0-1 LP approach is its flexibility. Most assembly problems 
can be modeled using 0-1 integer variables. Also, modeling is the only thing needed; once a 
model has been formulated, it is not necessary to design a heuristic or adjust software. 
Constraint violation is impossible. However, the approach does have an important tradeoff 
between the speed and optimality of its solutions. Exact solutions for larger problems are not 
possible in realistic time, but if an appropriate search algorithm is used, near-optimal solutions 
to practical problems, with values for the objective function 1-2% from its optimum, say, are 
often possible in minutes. 

The power of network-flow programming is its speed. If the test assembly problem can 
be formulated to have the special structure of a network-flow problem, exact solutions to large 
problems are possible in seconds. If not, the method has to be embedded in a heuristic 
approach. Typically, solutions then still take seldom more than a few minutes but are near 
optimal and may show occasional constraint violation. 

The optimal design approach differs from the others in several aspects. Its intention is 
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to calculate the best distribution of the item parameters values over their theoretical range 
given a criterion of optimality. Other test specifications than this objective are generally 
ignored. Optimal design is thus not a method for assembling a test from an finite, existing 
pool of items. However, its optimal distribution of item parameter values should be 
approximated in practice. In principle, it is even possible to build this distribution as a target 
in a 0-1 LP or network-flow model for test assembly. 

Applications 

A large variety of test assembly problems have been attacked using the approaches discussed 
in this paper. Applications range from the problem of assembling a set of multiple test forms 
simultaneously to observed-score equating and constrained adaptive testing. The most 
important results are now reviewed. 

Multiple forms . The first extension of the problem of finding an optimal single test 
form was the one of assembling a set of parallel forms. An obvious approach to the problem 
of multiple-form assembly may seem to apply the above approaches sequentially until the 
desired number of forms “is obtained. However, this approach would select the best items first 
and show a decrease in the qualify of the test forms. Therefore, simultaneous assembly of 
multiple test forms is a better alternative. 

As shown in Boekkooi-Timminga (1987a), a simultaneous approach involves 
replacing the decision variables in the model in (7)-(15) by variables Xif denoting the decisions 
whether (Xif=l) or not (Xif=0) item i in the pool will be assigned to form f=l,...,F. In addition, 
a set of constraints has to be added to prevent items from being assigned to more than one 
form: 



F 

l x if si. 



f = l 



i= 



( 22 ) 



Because the number of decision variables is equal to the size of the item pool times the 
number of forms, the approach is only possible for smaller problems. All developments for 
realistic multiple-form problems therefore have heuristic aspects. 

Adema (1992b) designed an approach in which the problem of assembling a set of 
parallel forms simultaneously is replaced by a series of computationally less intensive two- 
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form problems. A generalization of the approach to any set of test forms is given in van der 
Linden and Adema (1998). Other methods of assembling multiple test forms are proposed in 
Adema (1992), Boekkooi-Timminga (1990a; 1990b) and van der Linden and Carlson (1997). 
Solutions based on item matching are given Armstrong, Jones, Li and Wu (1996), Armstrong, 
Jones and Wu (1992) and in the contribution by Armstrong, Jones and Kunce (1998) to this 
special issue. The principle of item matching used in the Armstrong et al. papers is explained 
below. 

Item sets . A popular testing format is the one with sets of items related to a common 
stimulus, for example, a text passage in a reading test or a description of an experiment in a 
physics test. If each item set in the pool remains intact if selected for a test, an obvious 
approach is to attach aggregated item attributes as descriptors to the item sets and model the 
problem using 0-1 decision variables for the selection of sets. The problem becomes more 
complicated though if the number of items to be selected per set has to be smaller than the 
number in the pool, in particular if the selection also has to satisfy separate sets of constraints 
on item, test, and stimulus attributes. 

A flexible solution is possible using different decision variables for the stimuli and the 
items (van der Linden, 1992). Let s=l,..,S denote the stimuli in the pool and i s =i=l ,...,Is, the 
items nested under stimulus s. These indices can be used to define 0-1 decision variables z s 

and X is for the selection of the stimuli and items, respectively. The same variables are then 
available to model the various specifications at item, test, and stimulus level. They also allow for 
the simultaneous selection of stimuli and items provided the following constraint set is added to 
the model 

Is 

X Xi s - n s Zs=°, s=l,...,S. (23) 



The purpose of these constraints, which can be replaced by inequalities, is not only to keep the 
selection of stimuli and items consistent but also to set the number of items selected per set 
equal to n s . 

Classical test assembly . A basic problem in test assembly based on classical item and 
test parameters is that, unlike IRT, no meaningful test parameters can be found that are 
additive in the items. In particular, test reliability is a nonlinear function of the covariances 
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between all pairs of items, and, as a consequence, an attempt to assembly a test with 
maximum reliability may involve a procedure with endless backtracking. 

The problem of nonlinearity is illustrated for the maximization of Cronbach's alpha. 
Adding decision variables, the objective function is 

I 

maximize ot = f 1 — ^ ], (24) 

n-1 1 

XPi <*i *i 

i=l 



where a» and pi are the item standard deviation and item-test correlation, respectively. However, 
if the test length, n, is fixed, the objective is equivalent to the one of minimizing the ratio in the 
second factor. Also, both the numerator and denominator of this ratio are linear in the decision 
variables. Adema and van der Linden (1989) presented an LP solution in which the numerator is 
maximized and the denominator is constrained to be lower than a well-chosen small bound, c: 

I 

maximize XPi^ixi (25) 

i=l 



subject to 



Ic?xi<c, 

i=l 



(26) 



Simulation studies with this linearized version of Cronbach's alpha showed near-optimal 
results under a large variety of conditions. Armstrong, Jones and Wang (1994) extended the 
approach by building the constraint in (26) into the objective function in (25) using 
Lagrangian relaxation and embedding the new objective function into an algorithm that 
optimized the choice of c. 

Item matching . Problems of item matching arise if a set of test forms has to be 
assembled that are indistinguishable item by item. The first application of optimal test 
assembly methods to such problems was the use of 0-1 LP to find optimally matched test 
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* halves for estimating split-half reliability (van der Linden and Boekkooi-Timminga, 1988). 
The same problem has been addressed using network-flow programming in Armstrong and 
Jones (1992) and in the contribution by Sanders and Verschoor (1998) to this special issue 
who use a greedy heuristic. 

A related problem is the one of assembling a set of test forms to be parallel to an old 
form addressed in Armstrong, Jones, Li and Wu (1996), Armstrong, Jones and Wu (1992) and 
in the contribution by Armstrong, Jones and Kunce (1998) to this special issue. Network-flow 
programming is a natural approach to this problem because the items in the reference test can 
serve as demand nodes to which items for the set of forms are shipped at costs that are a 
function of the match between the items and the target (see Figure 3). Once the items have 
been shipped, a heuristic is used to assign the items from the demand nodes to the individual 
test forms. 

Observed-score equating . In large-scale testing programs old test forms are 
periodically replaced by new ones. The traditional approach is to assemble a new form, pretest 
its items, and equate the observed scores on the new form to those on the old form. An 
alternative would be to assemble the new form to have the same observed-score distribution 
as the old form for a population of examinees. The idea was explored in van der Linden and 
Luecht (1996) using an 0-1 LP model that matched both the test information and the test 
characteristic function of the new form to those of the old form, the idea being that these two 
functions would equate the error- and true-score distributions of the new form, and thereby its 
observed-score distribution. The same idea is used in Glas (1988) to equate cutscores on a 
new and old form and in the contribution by Armstrong, Jones and Kunce (1998) to this 
special issue of the journal. 

In a later paper (van der Linden & Luecht, in press), it is proved that the observed- 
score distributions on two test forms are equal if and only if 

£P[( 0 )=ZPj( 0 ), for r=l,...,n, (27) 

i=l j=l 

where Pj ( 6 ) and Pj(0) are the response functions of item i and j in the new and old form, 
respectively. Tlie result is based on a series expansion and in practice only a few lower-order 
equalities need to be met to get good results. Since the equalities are linear in the items, they can 
easily be built in a 0-1 LP model for assembling the new form. An empirical example for the 
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LSAT gave excellent results for a model that only had the equalities in (27) for r=l ,2,3. 

Constrained adaptive testing . Though the development of computerized adaptive 
testing was motivated by the idea of maximizing the statistical precision of ability estimation, 
real-life applications have shown the need of such tests to keep the content specifications 
constant across examinees as well. A 0-1 LP approach to adaptive testing in which the 
information in the test is maximized at the current ability estimate subject to a large set of 
constraints is presented in the contribution by van der Linden and Reese (1998) to the special 
issue of this journal. The algorithm starts with the on-line assembly of a full test that meets 
each of the constraints and is optimal at the initial ability estimate. Each next step, the most 
informative item from the test is administered and both the ability estimate and set of 
constraints are updated. An example for the LSAT shows that several hundred constraints can 
be built into the item selection procedure without sacrificing any precision of the ability 
estimator. A comparable approach based on network-flow programming was developed 
independently in Cordova (1997). An application of the algorithm with response-time 
constraints used to control adaptive tests for differential speededness between examinees is 
presented in van der Linden, Scrams and Schnipke (submitted). 

Assembling multidimensional tests . For larger item pools, a potential problem with the 
use of the simple logistic IRT models for item calibration is violation of their assumption of 
unidimensionality. If so, a multidimensional IRT model has to be used. However, for a model 
with multiple ability parameters test information is not a scalar, and the variance-covariance 
matrix of the estimators has to be addressed directly. Test assembly can then no longer follow 
Bimbaurns method based on a target for the test information function. 

A 0-1 LP-based algorithm for multidimensional test assembly is given in van der 
Linden (1996). The model is based on a target for the variance functions of the ability 
estimators using the fact that, though not linear in the items themselves, these functions are 
built up of linear expressions. In the model, some of these expressions are optimized, others 
constrained. Repeated application of the model systematically varying the bounds in the 
constraints can be used to find a solution fitting the targets for the information functions best. 
An example for an item pool from the ACT Assessment Program yielded test forms meeting a 
uniform target for the variance functions over the ability space. A version of the approach 
with Lagrangian relaxation is given in Veldkamp (submitted). 

Item pool design . The final application of optimal assembly methods in this review is 
the one to the problem of assembling an item pool. The importance of this application lies in 
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the fact that item pools in testing programs are not always on target. As a consequence, some 
portions of the item pool are quickly depleted whereas others may have items that are never 
used. 

The problem of item pool design has been explored in Boekkooi-Timminga (1991). 
Her approach starts with a tentative blueprint for the item pool from which test forms are 
assembled to find out what types of items are over and underrepresented. The results are then 
used to adjust the blueprint. Another approach is followed in van der Linden, Veldkamp and 
Veldkamp (in preparation). The decision variables in their integer programming model 
represent the numbers of items in the pool needed and optimal values for the variables are 
found using an objective function that minimizes an empirical estimate of the costs involved 
in item writing. 



Concluding Remark 

Modem measurement is characterized by the use of statistical models for the 
quantification of educational and psychological variables. As in any other quantitative field, 
an obvious next step is the application of optimization techniques to maximize the utility of 
the models. This special issue reviews a variety of applications of such techniques to the 
problem of optimal test assembly and presents several new applications. The mathematical 
techniques involved are neither new nor applied for the first time. However, what is new is the 
creativity involved in analyzing test assembly problems and structuring them such that the 
optimization techniques apply. Since most results are of recent date, it is anticipated that more 
will follow. 
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Figure Captions 

Figure 1 . Examples of targets for test information functions (1. Selection decision with cut 
score 0 O ; 2. Diagnostic test for low ability examinees; 3. Information function of an 
old test to be matched) 

Figure 2 . Example of a test assembly model or program. 

Figure 3 . A directed graph of a network-flow programming problem 
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Maximize test information at cut score 



subject to 

1 . No more than 10 items on knowledge of facts; 

2. At least 10 items on applications; 

3. Five items with graphics; 

4. Test length equal to 25 items; > 

5. Total number of words in test not larger than 1 ,500; 

6. Total expected response time not larger than 60 minutes; 

7. Items 64 and 64 not simultaneously in the test. 
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